Joint deep learning for batch effect removal and classification toward MALDI MS based metabolomics

Background Metabolomics is a primary omics topic, which occupies an important position in both clinical applications and basic researches for metabolic signatures and biomarkers. Unfortunately, the relevant studies are challenged by the batch effect caused by many external factors. In last decade, the technique of deep learning has become a dominant tool in data science, such that one may train a diagnosis network from a known batch and then generalize it to a new batch. However, the batch effect inevitably hinders such efforts, as the two batches under consideration can be highly mismatched. Results We propose an end-to-end deep learning framework, for joint batch effect removal and then classification upon metabolomics data. We firstly validate the proposed deep learning framework on a public CyTOF dataset as a simulated experiment. We also visually compare the t-SNE distribution and demonstrate that our method effectively removes the batch effects in latent space. Then, for a private MALDI MS dataset, we have achieved the highest diagnostic accuracy, with about 5.1 ~ 7.9% increase on average over state-of-the-art methods. Conclusions Both experiments conclude that our method performs significantly better in classification than conventional methods benefitting from the effective removal of batch effect. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04758-z.

In this study, we all use deep learning networks for classification. To achieve fair comparison, the types and numbers of network layers remain the same before and after calibration. The parameter "MALDI_MS" is a flag of mass spectrometry experiments, therefore it is set to False on all public CyTOF data. The learning rate, number of epochs, size of mini-batch and learning step are set to 10 −4 , 100, 200 and 10 4 respectively. In order to prevent overfitting the network, the L2 weight decay is set to 5 × 10 −5 during training.

After calibration
For training our network after calibration, the learning rate is set to 10 −4 and the number of epochs is set to 2000. The gradient update rule of mini-batch in deep learning is used to train our model where three losses are calculated from a sampled "mini-batch" during each iteration of the training process. We set the mini-batch size to 100 and the coefficients of three losses are = 0.01 , = 1 , = 0.01 by gridsearch. The learning step, L2 weight decay are set to 10 4 , 5 × 10 −5 during training.

In-batch Cross Validation
In order to compare all the results on the same benchmark, the classification network of in-batch 10-fold cross-validation shares the same framework as cross-batch experiments. The learning rate, number of epochs, size of mini-batch, learning step and L2 weight decay are set to 10 −4 , 15, 200, 10 3 and 5 × 10 −5 , respectively.

Before calibration
In this set of experiment, the flag variable "MALDI_MS" is set to True on all private MALDI MS data. The ID of the file containing the number of samples in each subject is consistent with testing set. The structure of the classification network is also consistent with the CyTOF experiments. The learning rate, number of epochs, size of mini-batch, learning step and L2 weight decay are set to 10 −3 , 100, 200, 10 4 and 5 × 10 −5 , respectively.

After calibration
For training after calibration, the ID of the file containing the number of samples in each subject is consistent with testing set. The learning rate, number of epochs, size of mini-batch and learning step are set to 10 −4 , 2000, 200 and 10 4 , respectively. In addition, the hyper-parameters of the coefficients of three losses are set to = 0.01, = 1, = 0.01 by grid-search. The L2 weight decay is set to 5 × 10 −5 during training, the same with first experiment.

In-batch Cross Validation
As the previous experiment, the classification network of inbatch cross-validation holds the same structure as cross-batch predictions. The ID of the file containing the number of samples in each subject is consistent with training set. The learning rate, number of epochs, size of mini-batch, learning step and L2 weight decay are set to 10 −4 , 25, 200, 10 3 and 5 × 10 −5 , respectively.

Information for Other Methods
Corresponding open source code could be found about those algorithms involved in comparative experiments. The ComBat and fSVA have been implemented by ComBat() and fsva() function respectively into R software package sva (http://bioconductor.org/packages/3.5/bioc/html/sva.html). In addition, commonly used batch effect removal functions including geometric.mean() that ratio_G adopted is implemented in Psych R package (http://cran.rproject.org/web/packages/psych/index.html). In principle, ratio-based data is obtained by scaling all samples through ready-made reference samples or the average of negative class samples in each batch. However, it should be mentioned that the two means above are not available in practice, because it is not possible to know the class label of the test batch before performing the prediction. We choose to utilize the mean of whole train samples (namely batch 1) as the reference and scale other batch sample values (intensity) by it, thus not leading to significant performance bias. The source codes of ResNet and NormAE algorithm are publicly available at https://github.com/ushaham/BatchEffectRemoval.git and https://github.com/luyiyun/NormAE, respectively. Since our data based MALDI MS instead of LC MS in NormAE, which not exist so-called injection order, therefore, it is eliminated in training and testing process. In addition, the mass quality control was conducted using standard molecules on the stage of serum plates, so it doesn't appear at the preprocessing matrix. In order to ensure the convergence of the model, except for the lr_disc_b, epoch and batch_size which are set to 0.0005, (100, 10, 100) and 200, other parameters are defaulted.

Robustness Verification
To prove the deep learning algorithm we developed could be applied to an entirely new similar set, we have collected a new batch of systemic lupus erythematosus (SLE) patients and healthy controls (HCs) subjects from Renji Hospital, including 89 SLEs and 75 HCs. We utilize the previously trained three batches to calibrate the batch effect of this new batch and predict its ACC, F-score, AUC and MCC. As shown in Table S1, no matter which batch is used for the training set, these indicators can be significantly improved after calibration. Permutation test is a computationally intensive based method that utilizes random arrangement of sample data for statistical inference. If random or "fake" labels could not achieve good results, our model is robust. Otherwise, it means that there are some problems with the model. Taking one old batch (i.e., batch 1) for training and the new batch for test, our implementation is the same shuffling pattern as the old batches in Figure 3. The accuracy values of random label are bell-shaped and very poor ( Figure S3a). In addition, we have also tried to quest whether the differences of key metabolites are still significant after arranging the measurement for this new batch. As shown in Figure S3b, the results illustrate that there is no significant difference in any m/z feature between the case and control groups, which further verifies the robustness of the algorithm in our study.